Weather in AUS

Hila Dar, Feb 2022

  1. Introduction
  2. Data Analysis and Data Visualization
  3. Preprocessing
  4. Feature Creation / Selection
  5. Random Forest - Training

Introducation

Problem definition

goal: Predict next-day rainfall in Australia using location specific data.

data: 10-years dailly weather obersavation from different locations. These observations have been taken from the Bureau of Meteorology's "real time" system. Most of the data are generated and handled

model random forest

summary

  1. Econding Wind direction: I had to consider how to encode/represent angle degree values (0-360 degress) in a continuous and periodic way (epsilon degrees should be near -epsilon degress). I chose to map the angle to the pair: $f: f(\theta) = (\sin(\theta), \cos(\theta))$

  2. Morning and afternoon: The data contains morning and afternoon features, which are inherently correlated. If $x_{AM}$ and $x_{PM}$ are the morning the afternoon fields, I chose to apply the following map: $(x_{AM}, x_{PM}) \to (x_{PM}, x_{PM}-x_{AM})$. In that sense, I only used the uncorrealted component of the morning fields.

  3. Missing Values: I used three-stage process to fill missing values. First, using the strong auto-correaltion of the fields, I completed the missing values by the n-previous-days mean (if possible). Next, using the seasonality of the fields, I filled the remaining missing values by the mean value of the field in the same month and location. Finally I dropped the remaining nan values.

  4. objective function - I care more about precision (predicting Rain when it's raining), and hence optimize for f1-score (rather than accuracy).

  5. Training & Selection - I trained a random forest using randomized grid search. I used oversamppling to overcome imbalance problems. I chose optimal paramers for the model using CV and chose the prediciton threshold that maximizes the f1 score.

==== Test ====

          precision    recall  f1-score   support

 no_rain       0.92      0.88      0.90      8857
    rain       0.63      0.72      0.67      2463

accuracy                           0.85     11320
Libraries
Loading the Data

Exploratory data analysis & Data Visualization

Simple Data Analysis

Features Dictionary

Feature Description Units
Date The date of observation yyyy-mm-dd
Location The common name of the location of the weather station
MinTemp Minimum temperature in the 24 hours to 9am. Sometimes only known to the nearest whole degree degrees Celsius
MaxTemp Maximum temperature in the 24 hours from 9am. Sometimes only known to the nearest whole degree. degrees Celsius
Rainfall Precipitation (rainfall) in the 24 hours to 9am. Sometimes only known to the nearest whole millimetre. millimetres
Evaporation "Class A" pan evaporation in the 24 hours to 9am millimetres
Sunshine The number of hours of bright sunshine in the day. hours
WindGustDir Direction of strongest gust in the 24 hours to midnight 16 compass points
WindGustSpeed Speed of strongest wind gust in the 24 hours to midnight kilometres per hour
WindDir9am Wind direction averaged over 10 minutes prior to 9 am 16 compass points
WindDir3pm Wind direction averaged over 10 minutes prior to 3 pm 16 compass points
WindSpeed9am Wind speed averaged over 10 minutes prior to 9 am kilometres per hour
WindSpeed3pm Wind speed averaged over 10 minutes prior to 3 pm kilometres per hour
Humidity9am Relative humidity at 9 am percent
Humidity3pm Relative humidity at 3 pm percent
Pressure9am Atmospheric pressure reduced to mean sea level at 9 am hectopascals
Pressure3pm Atmospheric pressure reduced to mean sea level at 3 pm hectopascals
Cloud9am Fraction of sky obscured by cloud at 9 am eighths
Cloud3pm Fraction of sky obscured by cloud at 3 pm eighths
Temp9am Temperature at 9 am degrees Celsius
Temp3pm Temperature at 3 pm degrees Celsius
RainToday 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0 Boolean
RainTomorrow The feature we are trying to predict - 1 if precipitation (mm) in the next 24 hours to 9am exceeds 1mm, otherwise 0 Boolean

Which features are categorical? Which features are numerical?

Which features contain blank, null or empty values?

Analyze by visualizing data

Features Distribution & Joint Distribution

Features Correlation

Location analysis

Time series analysis

Preprocessing

Completing

Next we would like to complete the missing values in our data.

To counter all of the problems above, I devised a three-stage process:

  1. First, fill the missing values with the average of the past n-days mean values in the same feature and location. This approch utilizes the low turnover of the data acorss time (see autocorrelation bellow) and works only if at least one value exists in the n-days preceding the missing data-point.
  2. Second, using the data seasonality, fill the missing values with the historical monthly-average in the same location.
  3. Finally, if a data point is still missing after steps 1-2, we conclude we don't have enough data to interpolate its values, and this datapoint is dropped.

Correcting

First we transform the categorical wind-direction data into numeric features. Ideally, this transformation $f: [0, 2\pi] \rightarrow\mathbb{R}^{n}$ should have three important properties:

  1. f is a continues.
  2. $f(0)=f(2\pi)$
  3. f is one-to-one, ie angle1!= angle2 then f(angle1)!=f(angle2).

I choose to apply the following map:

$f: f(\theta) = (\sin(\theta), \cos(\theta))$

Feature Creation / Selection

  1. As mentioned above, the morning and afternoon variables are correlated, hence we want to avoid using both of them. I choose to use the afternoon values as they show higher correlation to our y label. Additionally, we add the difference between the afternoon values and the morning values - to incorporate the information in the morning features without using correlated features.

  2. Since RainToday is a direct function of Rainfall (i.e. Rainfall>1$\rightarrow$ RainToday=1 else RainToday=0), we will drop RainToday, the feature with the less information.

A few more ideas i had:

Random Forest

Next we train a random forest model to predict next day rainfall.

Benchmark model

We set a benchmark model to better evaluate our model performance. Due to data imbalance, a simple "always predict no-rain" model has 78% accuracy. To maximize true positives and minimize false positives, we optimize all models with the metric precision and f1-score

Parameters Tuning

parametres choice:

search results:

Final Model